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A system for prwsMlng information contained io a collection 
of rext-based Information wujrew employs associative and linguis- 
tic expansion of input words in which associative expansion i* first 
performed, followed by simultaneous linguistic expansion irt acco*- 
dance with related morphological and phonetic rote* The system 
automatically generates and updates a linguistic knowledge base tor 
each language to be processed by analyzing a large body of text in 
each language. The system also automatically indexes the collec- 
tion of text-based Information Sowcw to be searched. A method i» 
provided to expand a word or term in a supported language using 
a two-dimensional (2D) expansion matrix providing great flexibil- 
ity, high accuracy ami to*r nofoe output The 2D expansion matrix 
includes an associative dimension that utilize* thesauri, databses of 
saved queries and other associated information sources, hi which 
wotdfl are related to other words by meaning and relations* *nd * 
lingpistie dirrteraloit which utflkos rccogniticm ^mrrtfirs, ho which 
words Are related to otter words by combined rules for morpholog- 
ten! mid phonetic variation* 



0 



two 

SOURCES 



MflQMfflC 
KNOWUOOE 
BASO 



tm h sum ok onmem 
. J _ 



Hi 







KXX 






• 















•OS-' 



X 



C_255_Jl 



LOOKUP 



~P 



\ OST OF \S* n 



PAGE 31/97 * RCVD AT 1/2912004 3:21:23 PM [Eastern Standard Time] « SVR:USPTO-EFXRM/0' DNIS:8729306 * CSID:3176346701 * DURATION (mm-ss):31-36 



01/29/2904 15:22 3176346701 BRINKS PAGE 32/97 



At, 
AM 
AT 

AlJ 
AZ 
BA 
Bt* 
ITO 
■F 
BO 
BJ 



Codes uasd to identify 

Albania 
AiracnS* 
AwaU 

Bcwjfr* Mrf HuJB|g vfcw 
Bcfehan 



ro* rjgrE purposes 

Stales party to the PCX ocx the front 



OF INFORMATION ONLY 

page* of pamphlet publishing intern*^ ^Hetties* unto ^ PCX 



BY 


Del*** 


CA 


Ceitnl AWctn Rep*l»e 


CF 


CO 




CH 




<3 


Cdte d' I vol re 


CM 


CUUU1MA 


CTT 




CU 


Cub* 


cz 


CukS Republic 






I>K 




KJZ 







Spain 


FI 




jm 




OA 




C5B 


United Ktafdwm 


CK 




QM 




CJN 


Oittac* 


Cfl 


Greece 


HU 


Ktai*«y 


rm 


IrcUnd 


EL 




19 




IT 


fttfr 


JP 




KB 





LS 

LU 
LV 



KP 

KR 

KZ 

LC 

LI 

LK 

LR 



Republic of Kott* 



t«wiha 
Utvb 



MC 




MD 


j^j^Jlk of MokJov* 


MG 




MK 




Republic* Mm*** 


Ml. 




MN 


Moct#a1M 


MR 


Mmitaml* 


MV 


MfcTtwi 


MX 




NE 




NT- 




NO 




HZ 


NewZctUnS 


Ft* 


Fol*nd 


rr 




no 




RU 




SX> 




3* 




SG 





S3 




SK 




SN 




SZ 




TD 


Ovd 


TO 


TV 


*J 


T^«un 


TM 


TntoMBiitv 


TW 


Tt*key 


TT 


Trhldid wl T<*»*o 


UA 








US 


XlnkHl St«c* of Amnrtc* 


UZ 




vN 




YU 




ZW 





PAGE 32/97 * RCVDAT 1/29/2004 3:21:23 PM [Eastern Standard Time] * SVR:USPTO-EFXRF-1/0* DNIS:8729306 * CSID:3176346701 * DURATION (mm-ss):3106 



01/29/2094 15:22 3176346701 



BRINKS 



PAGE 33/97 



PCT/IB97/00748 

- 1 - 

A ftYSTFM . iWFTWA FF ^MFTITTO 

r-QT t FrrfT ON OF TEXT BASFP IN FORMATION SOU RCES 

5 BAOtGRQlIMn 

1. Field of the Invention 

The present invention relates generally to the field of information retrieval. More 
particularly, the present invention relates to information management systems and computational 
linguistic systems for finding information related to a user-input query, in a collection of text- 
10 based information sources. 



2. Discussion of Related Art 

In the Information Age, the ability to manage enormous volumes of information 
efficiently and find needed information quickly has become a driving force in all human 
! s endeavors. Early in the development of information management systems, the capability to 

process large volumes of free-form text documents and other text-based information sources was 
severely limited. Therefore information specialists developed various types of database 
management systems and searching systems based on strictly controlling how data may be 
received, stored, and referred to. However, as the volume and nature of the information which 
20 must be handled by such sysiems has expanded, conventional database management systems 
have been unable to keep pace. 

In conventional database management systems, data is stored in a strictly structured 
environment. Such systems may be based upon tables of records or spreadsheet models, for 
example* Such systems may be flat or may be .relational with respect to how records in the 
25 database are associated with each other. However, conventional database management systems 
generally require structured records in which one or more fields may be searchable, i.e. are key 
fields. Furthermore, it is desirable that such key fields use terms, e.g. numbering systems, labels, 
etc., in a consistent manner which facilitates searching with known query values, i.e. 
combinations of numbers, labels, etc 
30 In order to locate information within general text-based information sources, so-called 

full-text searching has developed. Full-text searching of a collection of text-based information 
sources, such as English-language documents stored in a computer system, permits a user to 
write a query containing terms known to be used in relevant documents. The collection of 
documents is first fully indexed and the words of the documents in the index are compared with 
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the query tern*. In the simple* form of this type of system, an exact match between a query 
term arid an index entry must be found in order to identify a relevant document. Spelhng errors, 
word variants, etc. will tend to prevent finding all relevant documents. A technique called wild- 
carding may be used to partially alleviate this problem, but many irrelevant documents, referred 
, to as "noise » often turn up when wild-carding is used. An example of the use of wild-cardmg 
where a user query term includes only what the user has identified to be a word base of a relevant 
term, such as "comput*" for the concept of "compute," "computer," "computing 
"computation," etc., where indicates the portion of the term which ha 5 been left out. 

Modem, conventional, ftUUtatt searching systems have been developed which have a 
o much higher level of sophistication. For example, Pinkas, G., Natural Language Full-Text 
Retrieval System, Master's Thesis, University of Jerusalem, 1985, discloses a system which 
automatically expands a user's query to include additional relevant terms in a manner more 
noise^ree than simple wild-carding. The Pinkas system: (1) receives a user query composed of 
query-words and boolean operators; (2) expands the query linguistically, i.e. by referring to a 
15 pre-processed database of morphological and phonetic information; (3) expands the query 

associative*, i.e. by referring to a database of associated sub-queries; and (4) merges the results 
of steps 2 and 3 above. Morphological expansion draws in the infix variations of the query 
terms, while phonetic expansion brings in terms that may be generated by misspelled vowels 
(e.g. recieve - receive). Associative expansion draws into the query related terms as predefined 
20 by the user in the form of sub-queries being associated with a specific query-word (e.g. to 

associate the acronym "USA" with its full wording, one creates an association between the word 
"USA" and a query applying a boolean "and" operation to the following 4 words: "United", 
"States", "or, "America", restricted to a proximity of 1 word distance. Thus a comprehensive 
expanded query is generated to cover the many different words that may conceptually be related 
25 to user's original query. Some variation in the level of morphological expansion and the level of 
phonetic expansion to be performed is available to the user by selection of expansion parameters. 

However, this process of morphological and phonetic expansion suffers from many 
inefficiencies: it fails to recognize the fundamental differences between different "word-bases- 
such as morphological stems and phonemes, therefore it misses many relevant linguistic 
30 permutations affected by both mechanisms, and at the same time it generates a large amount of 
noise i e. false-positives, due to the combinatorial effect of combining both mechanisms. 
Moreover, this process also is fairly limited to recognizing and expanding single words, and even 
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then the interaction between the associative expansion and the linguistic expansion is fairly 
limited to a trivial merge of both results, having not shared a conceptual foundation that allows a 
mutual feedback (e.g. the query-word "airplane" expands to ("airplane" or "airplanes" or 
"aircraft") but not to "aircrafts*'. 

5 In conventional systems, query expansion depended upon a set of linguistic rules which 

were developed by an expert in the language to be processed. The set of linguistic rules was both 
extensive and relatively inflexible, since as many characteristics of the input language as possible 
had to be accounted for before processing any text-based information sources. Development of 
the linguistic rules for each language to be processed was a very labor-intensive and tirne- 

10 consuming task. 

Finally, conventional systems are known which require manual indexing of text sources, 
as wett as which index text sources automatically. Conventional indexes simply map a word 
found in the text sources to a location at which the word is found. Manual full-text indexing is 
extremely time-consuming and error-prone. Keyword indexing is subjective and also somewhat 
15 error-prone. 

SUMMARY OF TH P INVENTION 

Therefore, it is a general aim of the present invention to solve the problems noted above 
with respect to the prior art. Aspects of the present invention solving the problems of the prior 

20 art include at least a system, software and a method for processing information contained in a 
collection of text-based information sources. 

The system may include a computer or data processor and software structured as one or 
more software modules, units or functions which when executed in a specified order by the 
computer or data processor perform the desired information processing task. One or more 

25 software modules, units or functions may be made available in conventional manner as either 
compile-time or run-time library entries which may be referred to by a software program which 
is written in a manner to be aware of such a library. The present invention further provides a 
method to process query-concepts and transform them to an expanded/improved query using an 
expansion matrix providing great flexibility, high accuracy and low noise output 

30 According to one aspect of the invention, there may be provided a text-based information 

processing system, comprising an automatic linguistic knowtedge base generator having an input 
receiving a collection of text-based information sources and which produces a linguistic 
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knowledge base; an index generator bavins inputs receiving * coupon of text-based 
information source, and the linguistic knowledge base and which produce, an index of the 
received text-based information and further which update, the linguistic knowledge base to 
reflect the input, to the index generator and maintain correlation between the index and the 
5 linguistic knowledge base; a query processor having inputs receiving a ouery composed by an 
operas the linguistic knowledge base, the index and a thesaurus and which produces a bst of 
, ocat ions in the collection of text-based information source, relevant to the q uery. The text- 
based information preceding system may be object to numerous modification, and vanafon, 
For example, the automatic Unguistic knowledge base generator, the automatic index generator 
,o and the query processor may be embodied in various ways. 

In accordance with another aspect of the invention, in a text-based information process 
system, an automatic linguistic knowledge base generator may comprise a parser, rece.vmg an 
input Stream of terms and producing individual terms; a language recognizer connected to 
^eivethemdividualt^^ , 
,5 language to which each individua! term belongs; a normalizer connected to receive the indw.dual 
terms and further to receive Unguis rules for the language indicated by the output of the 
langUa ge recognizer and producing normalized terms; and a linguistic expander connected to 
receive the legal individual terms and producing entries stored in the linguistic knowledge base, 
in accordance with yet another aspect of the invention, in a text-based information 
20 processing system, an automatic 

terms and producing individual terms; a language recognizer connected to receive the md^dual 
terms from the parser and which produces an output indicative of a language to which each 
individual term belongs; a normalizer connected to receive the individual term, and further to 
receive linguistic rules for the language indicated by the output of the language recognizer and 
25 producing normalized terms; and an index entry generator connected to receive the legal 

individual terms and producing entries stored in the index when the terms have not previously 
been indexed and modifying an existing index entry when the terms have previously been 
indexed 

■ Finally, in accordance with yet another aspect of the invention, in a text-based 
30 information processing system, an expansion unit for expanding terms in . language may 
comprise an associative expander having an input receiving a term and having an output 
representing the term and at least one associated term found by the associated expander makmg 
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reference to a thesaurus; and a linguistic expander having an input connected to the output of the 
associative expander and having an output representing the input of the linguistic expander and 
at least one term linguistically related to the input of the linguistic expander and found by 
reference to a linguistic knowledge base for the language. 
5 The normalizers recited above may be constructed of two units. The first normalizer unit 

may be connected to receive the individual terms and the linguistic rules and producing terms 
from which illegal characters have been removed; and the second normalizer unit may then be 
connected to receive the terms from which illegal characters have been removed and the 
linguistic roles and which produces normalized terms including word stems found by applying 
] o the linguistic rules to the terms from which illegal characters have been removed. 

The present invention will be better understood by reading the Detailed Description of at 
least one illustrative embodiment of the invention, in connection with the attached drawing. 

BRIEF DESCRIP TION CSF THF. DRAWINGS 

r 5 In the drawings, in which like reference designations indicate like elements, 

Fig. i is a schematic block diagram of a computer or data processing system on which the 
present invention may be practiced; 

Fig. 2 is a schematic block diagram of the memory of Fig. I ; 

Fig. 3 is a flow chart of automatic linguistic knowledge base generation; 
20 Fig, 4 is a flow chart of automatic index generation; 

Fig. 5 is a flow chart of query expansion; and 

Fig. 6 is a flow chart of an information retrieval system including the features illustrated 
in Figs. 3-5. 

25 DETAILED DESCRIPTION 

In order to better understand the following detailed description, reference should be made 
to the following definitions. In this discussion a ''language 1 ' is considered to be any organized 
system of tokens, which have symbolic meaning. For convenience, the tokens are referred to 
hereinafter as '•words" or "terms," since the most common types of languages dealt with by text- 
based information systems are natural human languages composed of words or combinations of 
words, i.e. terms, which are understood to have specific meanings by humans. Thus, the terms 
"words" and "terms" are intended to encompass word phrases in those instances where a word 



30 
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terra ^ or word base i, associated i» «- »* °*« WOTdS a " 4 

having a defined reladonahip such - morphologic* proximity, pnoneuc ™*>" 

. ,.la,ed,en«in.sp~i fi ecc„.e*,e., .n.*--.-,!--*-!----'"'-'-— 

the terms, words and/or word bases stored therein. 

Lenses considered here have to*™ linguisuc nks for the man****"* »« 
phonetic variances which words ma, undergo. For exar»pl«, <h. morphea. rata of a ^ 
Lguage may define to .^>. formed from a singula noun, by changes -he shape of the 
„ word, i.e. addiog a final "V » English, while the phonetic rule, may represent th. common 
^■.^-^•--"•P-"--- Arable, fneordaab.se « 
asofW=rmgramtoboldali«ofiruchlingaistic™les. 

languages — «— * «* "° -» " tow to — ™ ,es ■ 

Hmguage. For example, the E*U* language n~rpho.og.ca! rule io, generating me p*. tense 
, a v!I does no. appl, to *. verb ■» go," which becomes « rather .bar. me _* 
-™d " Tlcrefbrc, exceptions to the rules may be held by . program ,n a table of 

excepuorts.s.thatw.rd.whichdo „ 
which do Obey *e rules. „ 0 »c„n« M „f.bc P ,e.=r« i nv=»dou,.»hugu,,«c 1[M wledacb. S e 

developed by applying me lir»ui*ic rules and the one or more tables of colons or u**m 

^escmtrttor, of.be variations of worf bases which produce meaning, putau.*- '» 
,ar,gcag=>, b* also In language generaHy. The "linguisdc knowledge base" is a .able. kv. « 
oalsetrfwordbasesandrelaKd words. Related words »„ ,bos= wo* which when 
uader me hnguls.ic rules for the language « detrained ,0 have the same word base. 

Ttepresen.mventioaiscor.s.ru^ . 
processing systems. An overview of such systems is given in connection with .he block diagram 
ofFig 1 Acompumrsyswordaraproce^s^emgcr^rallyineludes.p^or.oi,. 
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mcmory 103, one or more input devices 105, and one or more output devices 107, all 
interconnected through an interconnection mechanism 109. Many variations of this basic plan 
are possible. For example, viable systems may lack input devices 105 and output devices 107, 
communicating entirely through interactions with the memory 1 03 by external devices (not 
5 shown). Also, distributed computer systems and data processing systems are contemplated as 
falling -within this basic plan. The interconnection mechanism 109 may be an internal system 
bus of a personal computer or may be the Internet, through which a processor 101 interacts with 
a database stored on a remote memory 103. Other variations will be evident to those skilled in 
this art. 

10 Memory 103 may be classified into two categories useful to this discussion, long term 

memory (also called non- volatile memory), and short term memory (also called volatile 
memory). These two types of memory are often both used in computer systems and data 
processing systems, as shown in Fig. 2, Volatile memory 201 such as integrated circuit random 
access memory (RAM) is often used in close physical proximity to the processor 1 01 because the 

15 technologies in which such volatile memory 201 is most readily realized produce fast access 
times, such as are desirable to support fast processors 101 . Non-volatile memory 203 is often 
used to store massive quantities of data for longer periods of time because it can be more cheaply 
constructed than volatile memory of a similar capacity. Non-volatile memory 203 is often 
implemented as magnetic or optical disk or tape storage units, which provide a further advantage 

20 of data and software program interchange between different computer or data processing 
systems. As such., non-volatile memory 203 may be a software product disk on which are 
recorded signals representing instructions, which when executed by a processor 101 cause the 
computer or data processing system to perform a special purpose function. Software embodying 
aspects of the present invention may be recorded on such a non-volatile memory 203 for 

25 distribution by a manufacturer* for archival purposes, for access through a volatile memory 201 
by a processor 101. etc. 

In accordance with various aspects of the present invention* there may be constructed a 
system for searching through and locating information in a collection of text- based information 
sources. In accordance with various aspects of the invention, a linguistic knowledge base is first 

30 generated. Then, the collection of text-based information sources is indexed. A user next inputs 
a query defining the information sought. The query is expanded according to selected 
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associativc and linguistic rules, using a thesaurus and the linguistic knowledge base. Finely, 
information is identified which matches the various expanded query terms. 

The thesaurus, linguistic knowledge base and index may be stored in one or more 
computer files to which the system has access through memory 103. 
-, The aspects of the invention connected with automatic generation of the linguistic 

knowledge base, automatic generation of the index and query expansion are next described tn 
detail. 

L^H^XUgicJjgiigati an of the jM gais^JSassiiaAs&3aat 

According to one aspect of the invention, software as shown in Fig. 3 is prided which 
o when executed by a suitable data processing system will automatically generate the linguasUC 
knowledge base from an input body of text based information source,. For example, accordmg 
to this aspect of the invention, a collection of English language documents may be processed to 
generate an English language linguistic knowledge base. 

A smaH setoflingui.ticru.es 301, including a list of exceptions 302to the linguistic rules 
, 5 for a language, e.g. English, is first generated by statistical anaiysis of a large body of text bassd 
information. This small set of rules includes: 

a list of irregular words and word bases in the language, i.e. the list of exceptions noted 

above; . 
. a word normalization table specifying legal character, in the language, i.e. the alphabet of 
the language, and legal character positions in the language, e.g. speciai rules concerning 
characters which can only appear at specific locations within a word; 
. a prefix and suffix list specifying legal prefixes and legal suffixes in the language; and 

letter-to-sound rules for both ordinary words and proper names in the language. 
This set of rules 301 , including the list of exceptions 302, is then used to analyze a body of text 
based information sources 303, to generate a linguistic knowledge base 305 specify adapted 
from the body of text based information sources 303. The body of sources 303 may be selected 
to be sources from a particular field of endeavor in which future queries are expected to be made, 
for example. This will result in a linguistic knowledge base better able to cope with the specrf.es 
of that particular field of endeavor. The body of sources 303 from which the lmgmstic 
knowledge base 305 is derived may not be the same body of sources which is ultimately to be 
searched. However, automatically generating the linguistic knowledge base 305 from the body 
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of sources to be scorched has the advantage that the linguistic knowledge base 305 so produced is 
particularly well adapted to the body of sources to be searched. 

Automatic generation of the linguistic knowledge base 305 proceeds as follows. The 
body of text based information sources 303 forms an input stream of text 304 to the system. This 
5 input stream 304 is first parsed into words and terms 307 in accordance with either fixed word 
recognition rules or word recognition rules specific to one or more languages. The language of 
each of the words parsed from the input stream is then recognized 309. Once the language of a 
word has been recognized the word may be normalized 3 1 1 according to the linguistic rules 301 
for the language. Irregular words may also be recognized at this point, since known irregular 
] 0 words are already in the list of irregular words 302 and hence need no further processing. The 
system may also identify as potential new irregular words, those words meeting some rule-based 
criteria. Those previously unknown irregular words may be identified to a human operator for a 
determination of whether they should be added to the list of irregular words. Regular words are 
linguistically expanded 313 before being added to the linguistic knowledge base 305 such that 
15 word bases are stored in the linguistic knowledge base 305 along with a list of related words 
from the body of sources 303. Linguistic expansion 313 is discussed in greater detail bejow. 

The step of parsing 307 the input stream 304 into sentences and words takes place 
according to the following pseudo-code: 

Load segmentation rules; 
20 segment input stream into sentences and words using segmentation 

rules ; 

for each sentence 

{ 

for each word 
25 { 

if language not explicitly specified, identify 

word' s language 
end if; 

normalise word; 
30 return word and word' 0 coordinates ,- 

return rest of inpiic stream; 
} next word; 
} next sentence . 

35 Normalization is performed as follows, Normalization identifies and removes garbage characters 
from the words of the input stream 304. 
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For each character of an input word 
^ if the character is illegal 

[ set normalisation status according to the character; 
} 

else 

1 translate the character to the internal alphabet; 
add the translated character to the output word; 

} 

} next character. 

Finally, new keys are added to the linguistic knowledge base 305 by the following procedure. 

15 if language not explicitly specified 

identify the language of the xnput word; 
for each recognition type 

{ analy2e word according to recognition type and level; 
20 search for analysis results in key table; 

if result is found in key table 

^ next recognition type 
} 

25 else 

^ insert key and word into table 
if key has a legal sub-key 

30 { activate linguistic correction tpechaniam? 

} 

} 

} next recognition type. 



35 



Two useful ration types subject to analysis a, indicated in the above pseudocode arc 
morphological and phonological. The morphological analyzer of the described embodiment 
Junctions in accordance with the following procedure. The moxphological analyzer recei ves a 
list of valid prefixes and suffixes in the language identified for the input ward. 
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Start at end of word; 
strip next substring from end; 
for each substring of word 
{ /* search for prefix*/ 
5 if substring is found to be a prefix in the identified 

language 

Strip prefix - create initial stem; 
/* search for suffix */ 
10 start at beginning of stem; 

atrip next substring from beginning of stem; 
for each substring o£ stem 

if substring is found to be a suffix 

strip suffix - create stem; 
return stem; 
} endif; 
} next substring; 
20 } endif; 

} next substring. 

The phonetic analyzer converts each word into a phonological representation of the word on the 
basis of letter to sound rules. Words having similar or same phonological representations may be 

25 considered to be related by their phonetic morphology. 

When the above processes have been completed for the body of text based information 
sources 303 initially presented, a linguistic knowledge base 305 for the languages of the text in 
the body of sources 303 will have been automatically generated. When new text based 
information sources are added to the system, they are also processed as above. Thus, new 

30 sources increase the knowledge and accuracy of the linguistic knowledge base 305 through the 
addition of new information to the Linguistic knowledge base 305, as well as through the 
linguistic correction mechanism which corrects the contents of individual entries in the linguistic 
knowledge base 305 according to new information. The learning procedure which embodies the 
linguistic correction mechanism is as follows. 
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for each word 

5 re -analyze word; 

if analysis results match the new correct key 

{ delete word from body of previous key entry; 
add word to body of new correct key entry; 

10 ) 

) next word; 

if previous key entry i& empty 
delete previous key entry* 

, , When the system detects inconsistencies between the stents of the linguistic knowledge base 
305 and a newly presented text source, the affected wcri base and list of related words may be 
autoptical*, cr at the direction of. human operator, reana.yzed and updated in .chance 
with the newly presented information and the above procedure Thus, the system constantly 
I*™* about each language processed and updates the affected linguistic Knowledge bases. 

In addition to the linguistic knowledge base 3 05, the retrieval system trading to 
another aspect of the present invention shown in Fig. 4 automatically generates an mdex 401, 
wherebytextbasedinformationmaybefoundbyreferencctotheindex^l. Automatic 
generation of the index 401 is accompanied by updating of the linguistic knowlcd B e base 305, so 
25 that the contents of the linguistic Knowledge base 305 reflects the relevant terms contained 

body of text based information sources 303 and is thus correlated with the index 40 1 . The mdex 
401 simply ^ates words actually found in the body of text based informal sources 303 to 
locations within the body of sources 303. It is preferred that the location be denned 
hierarchically. For example, the location may be represented hierarchically by a document 
30 number, section number, sentence number and position number. Other hierarchical lc.at.on 

identification schemes may be used, as seen fit by those skilled in tta an. 

l naC cordancewi1hap re ferr^embodime„tofth C invention,th e indcx40l isassistedby 

the linguistic knowledge base 305. The index 401 includes only words and terms actually 
occurnne in the text-based information sources 303. The linguistic knowledge base 305 relates 
3S word bases derived from the words actually occurring in the text-based information sources 303 
to lists Of related words. During retrieval, which is plained below, the system retrieves a. 
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cntry ^ OJX1 ^ tfnguisttc knowledge base 305 which is then used to reference one or more index 
entries. 

Automatic generation of the index 401 proceeds as follows. The body of text based 
Information sources 303 forms an input stream of text 304 to the indexing subsystem. This input 

5 stream 304 is first parsed into words and terms 307 in accordance with the word recognition 
rules. The language of each of the words parsed from the input stream is then recognized 309. 
Once the language of a word has been recognized 309 the word may be normalized 3 1 3 
according to the linguistic rules 301 for the language. An index entry is then generated 403 for 
each new normalized word. If the normalized word already has an entry in the index 401 , then 

10 the location of the current occurrence of the word is added to the previous entry. 

At substantially the same time as the above process, the linguistic knowledge base 305 is 
continuously kept correlated with new and modified entries produced in index 401 . Each 
normalized word is reduced to its word base 405 in accordance with the linguistic rules of the 
language of the word. The word base and related word is then added to the linguistic knowledge 

15 base file 407, if not already present. The user may also specify that related words include 

various types of expansions of the word bases- If expansions arc included, expansion of the word 
base is performed before storing the word base and related words in the linguistic knowledge 
base File 305. When indexing of a body of text based information sources 303 is complete, the 
linguistic knowledge base 305 is correlated with the index 401 and reflects the relevant terms 

20 contained in the body of text based information sources 303. 
HI. Query Expansion 

Query expansion is performed in accordance with a third aspect of the invention, shown 
in Fig. 5. Since a query may contain more than one word or term, word recognition is first 
performed as above. 

25 The words and terms identified by the word recognition task may further be normalized. 

That is, they may be converted to a base form, if desired. By making reference to a thesaurus 
and linguistic rules, spelling errors may be removed, different lexical forms of acronyms and 
short-cuts may be recognized, etc. 

Each recognized word in the query may then be expanded using a 2D expansion matrix. 

30 The 2D expansion matrix is one way of defining the expansion space in which ap input word 

may be represented, The dimensions of this space are associative and linguistic. The associative 
dimension is based upon the meaning of words/word-bases in the language to be processed. In 
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the described embodiment of the invention, the associative dimension is defined by one or more 
thesauri 501 relating word, and terms to their synonyms, broader terras, narrower term* and 
other relations. Each thesaurus 501 includes a database of terms along with conceptual related 
term,. The thesaurus Is searchable by term. Thus, each thesaurus entry contains an entry key 
which is a list of searchable terms. Each entry key has assorted therewith one or more terms 
conceptually related to the entry key, such as synonyms, broader terms, narrower terms, 
associated terms, antonyms, etc. The inclusion of any one or more categories of assoc^on rs 
optional. Furthermore, each entry term may optionally have associated therewith a conventual 
dictionary definition and usage guide, » well as a query string into which the entry key may be 
translated when required. Thus, the thesaurus is a list of entries, wherein each entry has a 

structure substantially as follows: 

KEYWORD: (in the form of natural language phrase or term) Used as an entry key. 
DESCRIPTION (optional): A description of keyword meaning and usage (as in 
encyclopedic dictionaries). 

QUERY- A comply query statement in an underlying full-text query language that the 
keyword is translated to when required (optional). (E.g., KEYWORD "USA" - QUERY 
"United AND States AND of AND America" ) If a translation of the keyword to a 
complete query statement is not supplied explicitly, a default translation is applied to the 



keyword. 
20 * RELATIONS 



SYNONYMS: A list of keywords synonymous with the KEYWORD that 
comprise a concept or descriptor. 
BROADER TERMS 
NARROWER TERMS 
2S . ASSOCIATIONS 

OTHER 

All of these features ,nay be used by operator to determine whether associative expansion is 

having a desired effect. 

The linguistic dimension of this expansion space is based upon the linguistic knowledge 
30 base 305 of the language to be processed. As discussed above, the linguistic knowledge base is 
built automatically from the actual corpora of the text-based information sources, independent of 
manually crafted linguistic dictionaries, and not being restricted to "legal" or "proper" words. In 
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this embodiment of the invention, linguistic expansion grammars of morphology and phonetics 
are supported- 

The expansion task performs 2D expansion in substantially two main steps. First an 
associative expansion is performed at step 503, in which each input word of an input query 505 

5 is expanded to a list of words 507 including words having defined relations to the input word. 
The associated words are found by making reference to the thesaurus 501 . This expanded list of 
words 507 becomes the input on which linguistic expansion 509 is performed in both the 
morphological and phonetic dimensions, simultaneously. The morphological and phonetic 
expansion is controlled by making reference to the linguistic knowledge base 305. The linguistic 

10 expansion 509 may be controlled by expansion parameters 511 supplied by the user to include 
varying degrees of morphological expansion and phonetic expansion, ranging for both 
dimensions from no expansion in that dimension to full expansion in that dimension. By 
performing the morphological and phonetic expansions as a single, linguistic expansion step 509, 
expansion strategies for morphology and phonetics may be intelligently related. The 

15 relationships between the expansion dimensions are defined in the linguistic knowledge base 305 
for the language. Thus, a rule for morphological expansion may define a morphological 
variation which changes depending upon the phonetic properties of the input word or the 
expanded result As a result, less noise is generated in the expanded output because relating the 
morphological and phonetic dimensions as a single linguistic plane eUminates morphological 

20 variants which are phonetically unacceptable under the totality of the linguistic rules, and vice 
versa. 

TV. A Complete Text Retrieval System 

It can now be seen that using the software described above a retrieval system may be 
constructed as shown in Fig. 6 r which can perform efficient and accurate location of information 

25 within a collection of text-based information sources. Briefly, such a system is given access to 
one or more collections of text-based information sources 303a and 303b. At least one group of 
text-based information sources 303a is supplied to automatic linguistic knowledge base 
generating software 601, which generates the linguistic knowledge base 305 as described above. 
Text-based information sources 303b are provided to an indexing subsystem 603 which creates 

30 an index 401 of words in the text-based information sources 303b, in which each entry in the 

index 401 defines a relationship between a word and the location of the word in the collection, as 
described abow. It is preferred that the index 401 be generated using normalized words in one 
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or more languages for which the system has a thesaurus 501 and a linguistic knowledge base 
305. The indexing subsystem 603 may include a module to recognize words having forms which 
conform to one of the languages supported by the system and may further include an appropriate 
normalizing module for each language supported by the system. Words are normalized in their 
5 language as discussed above, to reduce the number of anomalous entries appearing in the index 
401 . The system further receives a user query 505 in the form of one or more words expressive 
of the information sought by the user. The query words are expanded 605 using a 2D expansion 
matrix, as discussed above. The query is first assoeiatively expanded to include words related to 
the Original query words by reference to the thesaurus appropriate for the language of the query 
10 words. The associativety expanded query is then linguistically expanded in both the 

morphological and phonetic dimensions, simultaneously. The degree of expansion in each 
dimensionisspecifiedbytheu^er.byr^etersSllsupphedwimthequcry. Thedegrceof 
expansion may be specified by the user, for example, by attaching a checklist of expansion 
parameters 51 1 to each query term. Finally, the terms of the fully expanded query 607 are 
j 5 compared 609 with the entries in the index 401 to find relevant locations 61 1 within the 
collection of text-based information sources 303b. 

Relevant locations 61 1 in the collection of text-based information sources 303b do not 
necessarily contain any of the original query terms. By the processing described above, the 
locations found 61 1 will contain one of the original query terms or a related term produced by 
20 the associative and linguistic expansion processes. The locations found 6 1 1 will not include 

many "noise" locations because the linguistic expansion process is performed as described above 
in a manner in which the morphological and phonetic linguistic rules are applied simultaneously 
in a synergistic manner that avoids the problem of applying a morphological rule to generate a 
phonetically nonsensical result or vice versa. 
25 In a system such as described above, the text-based information sources may be text 

documents stored on a computer system. In this case, it may be convenient for the indexing 
system to hierarchically refer to locations by document number, section number, sentence 
number and position with the sentence. Furthermore, freely formatted text documents may be 
processed by the above-described system. There is no need to structure the documents a 
30 particular way or to manually produce classifications or keywords, as done in some prior art 
systems, because the present system indexes words and manipulates queries according to the 
rules of the language in which the words occur . 
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If it is desired that a phrase be treated as a single word or term in a particular language, 
then thai phrase may be so defined as a conceptual entity in a thesaurus, In all other respects, the 
phrase so defined as a word is treated simply as a word in the language. However, it is 
unnecessary to declare a long list of accepted keywords because the process of indexing and 
5 query expansion generates accurate, relatively noise-free matches for user queries reasonably 
expressive of the information sought. 

Having thus described at least one illustrative embodiment of the invention, various 
alterations, modification and improvements will readily occur to those skilled in the art. Such 
alterations, modifications and improvements are intended to be within the spirit and scope of the 
10 invention. Accordingly, the foregoing description is by way of example only and is not intended 
as limiting. The invention is limited only as defined in the following claims and the equivalents 
thereto. 
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CLAIMS 

1. A text-based infoTmation processing system, comprising: 

an automatic linguistic knowledge base generator having an input receiving a collection 
s of text-based information sources and which produces a linguistic knowledge base; 

an index generator having inputs receiving a collection of text-based information sources 
and the linguistic knowledge base and which produces an index of the received text-based 
information and further which updates the linguistic knowledge base to reflect the inputs to the 
index generator and maintain correlation between the index and the linguistic knowledge base; 
10 and 

a query processor having inputs receiving a query composed by an operator, the linguae 
knowledge base, the index and a thesaurus and which produces a list of locations in the 
collection of text-based information sources relevant to the query. 



15 



20 



25 



30 



2. In a text-based information processing system, an automatic linguistic knowledge base 

generator, comprising: 

aparser. receiving an input stream of terms and producing individual terms; 

a language recognizer connected to receive the individual terms from the parser and 
which produces an output indicative of a language to which each individual term belongs; 

a normahzer connected to receive the individual terms and further connected to receive 
linguistic rules for the language indicated by the output of the language recognizer and producing 

normalized terms; and 

a linguistic expander connected to receive the normalized terms and producing entries 

stored in the linguistic knowledge base. 

3. The system of claim 2, wherein the normalizer further comprises: 

a first normalizer unit connected to receive the individual terms and the linguistic rules 

and producing terms from which illegal characters have been removed; and 

a second normalizer unit connected to receive the terms from which illegal characters 

have been removed and the linguistic rules and which produces normalized terms including word 

stems found by applying the linguistic rules to the terms ftom which illegal characters have been 

removed. 
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4. In a text-based information processing system, an automatic indexer, comprising: 
a parser, receiving an input stream of terms and producing individual terms; 

a language recognizer connected to receive the individual terms from the parser and 
which produces an output indicative of a language to which each individual term belongs; 
5 a normalizer connected to receive the individual terms and further connected to receive 

linguistic rales for the language indicated by the output of the language recognizer and producing 
normalized terms; and 

an index generator having inputs receiving a collection of text-based information sources 
and the linguistic knowledge base and which produces an index of the received text-based 
10 information and further which updates the linguistic knowledge base to reflect the inputs to the 
index generator and maintain correlation between the index and the linguistic knowledge base. 

5. The system of claim 4, wherein the normalizer further comprises: 

a first normalizer unit connected to receive the individual terms and the linguistic rules 
15 and producing terms from which illegal characters have been removed; and 

a second normalizer unit connected to receive the terms from which illegal characters 
have been removed and the linguistic rules and which produces normalized terms including word 
stems found by applying the linguistic rules to the terms from which illegal characters have been 
removed. 

20 

6- In a text-based information processing system, an expansion unit for expanding terms 
in a language, comprising: 

an associative expander having an input receiving a term and having an output 
representing the term and at least one associated term found by the associated expander making 
25 reference to a thesaurus; and 

a linguistic expander having an input connected to the output of the associative expander 
and having an output representing the input of the linguistic expander and at least one term 
linguistically related to the input of the linguistic 
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words are related to other words by combined rules for morpholog- 
ical and phonetic variation. 
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UK 
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TG 
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CH 


Ohni 
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MK 
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TM 
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VP 
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OR 
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BG 
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ML 
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BJ 
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MN 
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UA 


Ufarnnio 


on 
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rt 
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MR 
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DO 




BY 
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MW 
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US 
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CA 
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CF 
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KR 
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NL 
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TU 
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NO 
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ZW 
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CI 




KT 
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China 


KR 
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OJ 
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KZ 
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